Analytical Data Processing Engine

ABSTRACT

Some examples include high-performance query processing of real-time and offline temporal-relational data. Further, some implementations include processing streaming data events by annotating individual events with a first timestamp (e.g., a “sync-time”) and second timestamp that may identify additional event information. The stream of incoming data events may be organized into a sequence of data batches that each include multiple data events. The individual data batches in the sequence may be processed in a non-decreasing “sync-time” order.

BACKGROUND

Data analytics platforms may analyze large amounts of data in order toderive insights from the data. In some cases, efficient analysis of suchlarge amounts of data may be difficult to perform in a cost-effectivemanner. Further, one set of technologies may be employed in the contextof streaming data analytics while another set of technologies may beemployed in the context of offline data analytics.

SUMMARY

This Summary is provided to introduce a selection of concepts in asimplified form that are further described below in the DetailedDescription. This Summary is not intended to identify key features oressential features of the claimed subject matter, nor is it intended tobe used to limit the scope of the claimed subject matter.

Some implementations provide techniques and arrangements forhigh-performance query processing of real-time and offlinetemporal-relational data. In some cases, a stream of incoming dataevents may be received, and each individual data event may be annotatedwith a first timestamp (e.g., a “SyncTime”) and a second timestamp(e.g., an “OtherTime”). The first timestamp may identify when aparticular data event in the stream of incoming data events is received,and the second timestamp may identify additional information associatedwith the particular data event.

To illustrate, in some implementations, SyncTime may denote the logicalinstant when a fact becomes known to the system. In the case of aninterval [Vs, Ve], the event may arrive as a start-edge at timestamp Vsand an end-edge at timestamp Ve. Hence, SyncTime for the start-edge andend-edge may be set to Vs and Ve, respectively. In the case of aninterval, OtherTime may provide some a priori knowledge of the future(i.e., that the event is known to end at some future time Ve). In thecase of an end-edge, OtherTime may provide additional context to theevent (i.e., the start time that the event is originally associatedwith).

In some implementations, the stream of incoming data events may beorganized into a sequence of data batches that each include multipledata events. Processing of the individual data batches in the sequencemay occur in a non-decreasing SyncTime order.

In some cases, representing timestamps as SyncTime and OtherTime mayhave several advantages. For example, both interval and edge events maybe represented using two timestamp fields instead of three, therebyreducing the space overhead of events and reducing memory bandwidthusage during query processing. Further, because data is processed inSyncTime order, SyncTime may be frequently used in operator algorithms.By explicitly identifying the SyncTime rather than computing on-demand,performance benefits may be achieved.

BRIEF DESCRIPTION OF THE DRAWINGS

The Detailed Description is set forth with reference to the accompanyingfigures. In the figures, the left-most digit(s) of a reference numberidentifies the figure in which the reference number first appears. Theuse of the same reference numbers in different figures indicates similaror identical items or features.

FIG. 1 illustrates an example of a framework for query processingaccording to some implementations.

FIG. 2 is a block diagram illustrating an example of data organizationwithin a data batch message, according to some implementations.

FIG. 3 illustrates comparative examples of several data organizations,according to some implementations.

FIG. 4 is a block diagram illustrating an example of data organizationwithin a data batch message that includes multiple columnar payloads,according to some implementations.

FIG. 5 illustrates an example process flow for code-generation at querycompile-time, according to some implementations.

FIG. 6 illustrates an example process flow for producing incrementalsnapshot-oriented results, according to some implementations.

FIGS. 7 and 8 illustrate example process flows for performing temporaljoin operations, according to some implementations.

FIG. 9 illustrates an example process flow for dynamically setting abatch-size, according to some implementations.

FIG. 10 illustrates an example process flow for grouping sub-queries,according to some implementations.

FIG. 11 illustrates an example computing device and system environmentin which some implementations may operate.

DETAILED DESCRIPTION

The present disclosure relates to a query processing engine for realtime and offline temporal-relational data. In some embodiments, thequery processing engine may be used as a streaming temporal analyticsengine (referred to herein as a “streaming engine”) for high-performanceprocessing of streaming data (e.g., with query processing speeds on theorder of a trillion events per day in some cases). The streaming engineof the present disclosure utilizes a logical temporal model withparticular design choices and restrictions at the physical level thatmay significantly increase the efficiency of the engine (e.g., by afactor of 100 to 1000) compared to existing streaming engines. Further,the temporal and data model of the present disclosure may be used for adiverse set of data analytics, including offline relational-stylequeries, real-time temporal queries and progressive queries.

In the present disclosure, each tuple in the system has two associatedlogical timestamps to denote when the tuple enters computation and whenthe tuple leaves computation. Users can express a query “Q” over thetimestamped data. Logically, the dataset may be divided into snapshotsthat are formed by the union of unique interval endpoints of all eventsin the system. The query Q is logically executed over every snapshot. Aresult is present in the output at timestamp “T” if it is the result ofQ applied to data that is “alive at” (whose intervals are stabbed by) T.This logical model of streams may be referred to as the CEDR model(“Complex Event Detection and Response”).

The present disclosure describes using particular design choices andrestrictions in the physical model in order to get high performance withthe CEDR logical temporal model. For example, in the present disclosure,physical data events are annotated with two timestamps, a “SyncTime”timestamp and an “OtherTime” timestamp. In some implementations, bothtimestamps may be represented as long values (e.g., as 8-byte integers).A logical event with (logical) lifetime [Vs, Ve] may physically appearin two forms. In the first form, the logical event may physically appearas an interval event with a SyncTime of Vs and an OtherTime of Ve. Inthe second form, the logical event may physically appear in the form oftwo events: a start-edge event and an end-edge event. In this case, thestart-edge event provides the start time and has a SyncTime of Vs and anOtherTime of ∞. The end-edge event has a SyncTime of Ve and an OtherTimeof Vs. This type of logical event is different from an interval eventbecause an interval event has an OtherTime that is not smaller than theSyncTime. That is, the three different kinds of events (interval, start,and end) can be distinguished by a computation on the two timestamps,without additional storage. Further, there may be a start-edge eventthat has no matching end-edge event. This represents data that entersthe system and never expires.

Thus, the SyncTime timestamp represents the logical instant wheninformation becomes known to the system. In the case of an interval [Vs,Ve], the information arrives as a start-edge at timestamp Vs and anend-edge at timestamp Ve. Hence, in the case of an interval event, theSyncTime for the start-edge is set to Vs and the SyncTime for theend-edge is set to Ve. The timestamp OtherTime represents additionalinformation. In case of an interval, the OtherTime timestamp providessome a priori knowledge of the future (i.e., that the event is known toend at some future time Ve). In case of an end-edge, the OtherTimetimestamp provides additional context to the event (i.e., the start timethat the event is originally associated with).

In the present disclosure, incoming data streams are grouped. In otherwords, for every payload there exists a key that identifies a logicalsubstream within the original stream. Keys (and their hash values) arematerialized as part of events to avoid the expense associated withre-computation.

The present disclosure constrains data to be ingested and processed innon-decreasing SyncTime order. In other words, each operator processesand produces tuples in this non-decreasing SyncTime order. The dataconstraints result in various operator algorithm optimizations describedfurther below. Further, representing timestamps as SyncTime andOtherTime has several advantages. For example, both interval and edgeevents may be represented using two timestamp fields instead of threetimestamps, resulting in a reduction of the space overhead of events anda reduction in memory bandwidth usage during query processing. As such,having the SyncTime explicit rather than being computed on-demand mayprovide further performance benefits.

Given that data is timestamped and exists in SyncTime order, every eventessentially also represents a punctuation in terms of guaranteeing thatfuture events will not have a lower sync-time. However, in the system ofthe present disclosure, explicit punctuations may still be used forseveral purposes. As one example, a punctuation may indicate the passageof time in case there are lulls in the input (i.e., periods where thereis no incoming data). As another example, the query processing enginemay internally batch events before sending the information to a nextoperator. In this case, punctuations may serve to prompt the system intoproducing an output, which may involve closing out and pushing partiallyfilled data batches to the next operator. Thus, in the system of thepresent disclosure, punctuations may expose and exploit the tradeoffbetween throughput and latency. The system may batch events and processas batches for improved performance. While this may add to latency, theuser may issue a punctuation to “kick” the system into flushing partialbatches and producing output immediately.

The query processing engine of the present disclosure materializesgrouping information as part of events. To illustrate the logical groupsof streams, a stream that is grouped by key K logically represents ndistinct sub-streams, one for every distinct value of the grouping keyK. An ungrouped stream is represented as a stream grouped by Unit (anempty struct). There is a single timestamp domain across all groups. Inother words, passage of time occurs across all groups and does not occuron a per-group basis. The grouping key K and its hash value H arematerialized and stored as part of events. In an alternativeimplementation, the grouping key K and/or hash value H may not bematerialized. Instead, the expression to compute the associated valuesmay be cached and the grouping key K and/or hash value H may be computedon-demand.

Further, the present disclosure may provide a “one size fits many”approach to data analytics. That is, the temporal model of the presentdisclosure may support a wide range of analytics, including not onlyteal-time temporal queries, but also offline temporal queries,relational (atemporal) queries, and progressive queries. The presentdisclosure describes a “pay-as-you-go” approach that allows a singlestreaming engine relying on a general model to be used for each of thesediverse use cases without sacrificing performance. In general, theapproach adheres to the principles that simple operations may be “very”fast, that one may not pay for temporality unless temporality isnecessary for the query, that one may not pay for progressiveness(approximate results) unless it is needed, and that operations may be“very” fast for simple data types, while degrading gracefully as typesbecome complex.

In general, in order to improve query processing performance on a singlemachine, the amount of data transferred between a main memory subsystemand a central processing unit (CPU) may be minimized. The streamingengine of the present disclosure uses a “bottom-up” or “emergent”architecture and design in order to remove bottlenecks that may beassociated with existing streaming engines. Developing the emergentarchitecture of the present disclosure included initially determining aspeed associated with a pass-through query and subsequently adding oneoperator after another while being careful not to degrade existingperformance, in order to avoid any single bottleneck in the system. Theresult is an architecture that carefully separates out two kinds of workdone by a streaming engine. One kind of work is referred to herein as“fine-grained work” and represents work that is done for every event inthe system, while the other kind of work is referred to herein as“coarse-grained work” and represents work that is performed infrequentlyand that may be amortized across multiple events. Separating the twokinds of work allows for a design where the unit of data processing atthe engine level is a “message” (i.e., a single unit that can consist ofa batch of physical events). The engine of the present disclosureoperates over a sequence of messages as input.

In the present disclosure, a query plan is organized as a graph ofoperators (a directed acyclic graph (“DAG”) of operators), with eachoperator receiving and generating messages. Operators performfine-grained work in carefully written tight loops that leverage datalocality, instruction locality, compiler loop-unrolling, unsafe code,code generation, and other optimizations that are discussed furtherbelow. Coarse-grained work is performed by the engine and operators atmessage boundaries. This course-grained work may include handing offdata across cores, allocating messages, and updating memory pools.

Each operator receives a batch, processes events within the batch, andwrites output events into an output batch. The output batch is nothanded off to a next downstream operator until the batch is full, or acontrol event (punctuation) is received. This exposes athroughput/latency tradeoff that can be exploited, as described furtherbelow.

EXAMPLE IMPLEMENTATIONS

FIG. 1 illustrates an example framework 100 for high-performance querydata processing according to some implementations.

In FIG. 1, input data sources 102 (for example, as observables) may beused to build a logical query plan 104 that may then be connected to oneor more outputs 106 (for example, observables). In some implementations,the query processing engine may include technology (e.g., .NETtechnology, etc.) that may be available as a library (i.e., noheavy-weight server). This may allow the use of .NET types andexpressions in specifying a query 108 for processing. In some cases, thequery 108 may be written using LINQ (for “language integratedquerying”).

During creation of the logical query plan 104, query compile-time worksuch as property derivation may also be performed. When subscribing toan output 106, a Subscribe( ) call may propagate through the query plan104, resulting in the construction of a physical plan 110 and associatedphysical DAG of operators 112 (this includes the creation andinstantiation of code-generated operators). Input observers 114 feedcontent in the form of messages (e.g., via On( ) messages). For example,FIG. 1 illustrates an example of a data batch message 116 and a controlmessage 118 being provided by the input observers 114. This triggersquery processing and results in output (e.g., as messages) being pushedto result observers 120 as the output is produced.

As described further below, in some cases, operators may be supported bymemory pool management interfaces to memory pools 122 that allow re-useof messages and data structures 124 within the engine in order to reducethe cost of .NET allocation and garbage collection. As described furtherbelow, this may also provide fast data structures such as FastDictionaryand blocking concurrent queues to enable high performance. In somecases, a scheduler 126 may allow for techniques to effectively scale outcomputation to multiple cores using a special version of GroupApply (asdescribed below) or using a more general scheduler design for operators.

In the system of the present disclosure, data is organized as a sequenceof messages. A message can be of multiple types. For example, a“DataBatch” message (e.g., the message 116 in FIG. 1) may include abundle of physical data events that are organized for high performancequery processing. In the present disclosure, the notion of batches isphysical. That is, the output of the query 108 (also called logicalresult) is unaffected by batching. Rather, physical aspects (such as“when” output is produced) are affected by batching. As another example,a “Punctuation” message (e.g., the control message 118 in FIG. 1) mayinclude a special control event that is associated with a timestamp. Asa further example, a “Completed” message (e.g., the control message 118in FIG. 1) may include a special control event that indicates the end ofa stream.

Referring to FIG. 2, an example of the organization of data within aDataBatch message (e.g., the DataBatch message 116 of FIG. 1) isillustrated and generally designated 200.

In the system of the present disclosure, the stream of incoming dataevents (e.g., from the input data sources 102 in FIG. 1) may beorganized into a sequence of batches. Each batch may include a largenumber of data events (e.g., up to “BatchSize” events). Each DataBatchmessage 116 stores the control parameters in columnar format, as shownin FIG. 2. Each DataBatch message 116 may include multiple arrays,including a SyncTime array 202, an OtherTime array 204, a Bitvectorarray 206, a Key array 208, a Hash array 210 and a Payload array 212.

The SyncTime array 202 is an array of the SyncTime values of all eventsin the batch. The OtherTime array 204 is an array of the OtherTimevalues of all events in the batch. The Bitvector array 206 is anoccupancy vector, where the array includes one bit per event. A bitvalue of 0 may indicate that the corresponding data event exists, whilea bit value of 1 may indicate that the event does not exist (is absent).The Bitvector array 206 may allow efficient operator algorithms in somecases by avoiding the unnecessary movement of data to/from main memory.For example, a “Where” operator can apply a predicate, and if thepredicate fails, the corresponding bitvector entry in the Bitvectorarray 206 may be set to 1. The Key array 208 may be an array of thegrouping key values of the data events within the DataBatch message 116.Similar to payloads, the grouping keys in DataBatches may also be typeswith multiple fields. In this case, the key fields may be represented asa set of arrays, one per field in the grouping key. The Hash array 210is an array of hash values (e.g., 4 byte integers), each associated witha particular key. Keys and their associated hashes may be pre-computedand materialized to avoid the expensive task of the operators computingthese values.

The Payload array 212 is an array of all the payloads within theDataBatch message 116. In some implementations, the system of thepresent disclosure may support arbitrary .NET types as payloads.Further, the system of the present disclosure supports two modes forpayloads, a row-oriented mode and a columnar-oriented mode. During queryprocessing, one may switch from the column-oriented mode to therow-oriented mode or vice versa, based on the types and operators in thequery plan. In both cases, a user view of data is row-oriented. In otherwords, users may write queries assuming a row-oriented view of data. InFIG. 2, an example payload organization 214 illustrates that, in arow-oriented mode, the payload 212 is simply an array of “TPayload”(i.e., the type of the payload 212).

To illustrate how columnar payload organization is supported, FIG. 3illustrates an example payload data type 302 that a user is using towrite their queries over. In the example of FIG. 3, the payload datatype 302 relates to “TitanicPassenger” data and includes fiverepresentative fields for illustrative purposes only. In this example, afirst field 304 (“survived”) identifies whether a passenger did or didnot survive the Titanic disaster (i.e., true or false), a second field306 (“pclass”) identifies a passenger class, a third field 308 (“name”)identifies a passenger name, a fourth field 310 (“sex”) identifies apassenger sex, while a fifth field 312 (“age”) identifies a passengerage. It will be appreciated that it is typically more common for thesetypes to have many more fields than the five fields shown in thisexample.

In the row-oriented mode, the DataBatch message 116 contains an array ofvalues of this type. As shown, each value is an instance of a class, sowhat is actually kept in the array is an address of a location in a heapwhere the different fields are stored in contiguous memory (except forthe strings, which themselves are heap-allocated objects, so astring-valued field is itself a pointer to somewhere else in memory).This may not provide the data locality desired for high-performancecomputing. If the type is a “struct” instead of a class, then the valuesof the fields for each instance are kept within the array itself. Thismay improve the data locality, but it may often be the case that aparticular query accesses only a small subset of the fields (and, asnoted above, it is also much more common for these types to have manymore fields than the five shown in this example). As such, in executingthe query, even though the array of payloads may be accessed in anefficient manner, the sparse access to the fields within each payloadmeans that there is still insufficient data locality.

The present disclosure illustrates that efficiency may be improved byorganizing the data as column-oriented data, where there is an array ofvalues for each field. In FIG. 3, a batched-columnar version of the datatype is illustrated and generally designated 314. To illustrate, in thebatched-columnar version 314, there is an array (identified as “[ ]” inFIG. 3) of “survived” values for the first field 304, an array of“pclass” values for the second field 306, an array of “name” values forthe third field 308, an array of “sex” values for the fourth field 310,and an array of “age” values for the fifth field 312. With the dataorganized as column-oriented where there is an array of values for eachfield, the implementation of a query may access contiguous elements ofthe arrays of the fields mentioned in the query to provide a higherlevel of data locality. Therefore, in some implementations, acolumn-oriented definition may be generated for each type that thequeries are expressed over.

In the columnar data batch organization, the organization of data as asequence of data batches with columnar control fields and payload fieldsmay improve the speed of serialization and de-serialization of data toand from external locations (such as a local/remote disk or anothermachine via the network). That is, entire databatch arrays may bewritten directly to the stream without performing fine-grained encoding,decoding, or processing of individual rows within the data batch,thereby improving the speed.

Further, in the columnar data batch organization, ColumnBatches ofstrings may be handled specially. To illustrate, a naïve creation of anarray of strings may result in a large number of small string objectswhich can be expensive and affect garbage collection and systemperformance. Instead, the system of the present disclosure may store allof the strings within the batch end-to-end in a single character array,with additional information on the starts and offsets of individualstrings within the array. In some cases, string operations may beperformed directly on the character array. Alternatively, in some cases,the system may copy over individual strings on-demand to a small fixedtemporary buffer and perform the string operations on that buffer. Thisorganization of strings may also allow for performing serialization andde-serialization of strings in a high-performance manner (e.g., usingthe bulk write mechanism described above).

FIG. 3 further illustrates an example of a generated columnar batch type316, in which the type of each field in a column-oriented definition isactually a more complicated type: ColumnBatch<T> (where T is the type ofthe associated field). In this example, the array of values (e.g., thearray of values for the fields 304-312) is just one part of aColumnBatch, which also stores control parameters.

FIG. 4 illustrates an example of data organization within a DataBatchmessage with multiple columnar payloads and is generally designated 400.Note that, from a user's perspective, queries are still written with arow-oriented view of the data. FIG. 5, described below, illustrates howquery processing occurs with columnar payloads.

FIG. 4 illustrates that the DataBatch message 116 may include N columnarpayloads. For example, the columnar payloads may include a firstcolumnar payload 412 (“Payload Column(1)”) through an Nth columnarpayload 414 (“Payload Column(N)”). Referring to previous example thatincludes the five fields 304-312, the first columnar payload 412 mayinclude an array of “survived” values for the first field 304, while theNth columnar payload 414 may correspond to a fifth columnar payload thatincludes an array of “age” values for the fifth field 312. While notexplicitly shown in FIG. 4, it will be appreciated that a secondcolumnar payload may include an array of “pclass” values for the secondfield 306, a third columnar payload may include an array of “name”values for the third field 308, and a fourth columnar payload mayinclude an array of “sex” values for the fourth field 310.

While FIGS. 3 and 4 illustrate an example in which the message is aDataBatch message, the system of the present disclosure also supportscontrol messages such as punctuations and completed messages.

Turning to the topic of memory management, one performance issue instream processing systems is associated with the problem of fine-grainedmemory allocation and release (“garbage collection” or “GC”).Traditionally, these are very expensive operations and the same is truein a .NET implementation or similar type of implementation. The presentdisclosure describes an approach to memory management that retains theadvantages of the high-level world of .NET and yet provides the benefitsof unmanaged page-level memory management. One advantage of not usingunmanaged memory for memory management is avoiding the problem ofhandling complex .NET types.

In the system of the present disclosure, a memory pool (see e.g., thememory pools 122 of FIG. 1) may represent a reusable set of datastructures (see e.g., the data structures 124 of FIG. 1). A new instanceof a structure may be allocated by taking from the pool instead ofallocating a new object (which can be very expensive). Likewise, when anobject is no longer needed, the object may be returned to the memorypool instead of letting the garbage collection process reclaim thememory.

In the system of the present disclosure, there are two forms of pools. Adata structure pool may hold arbitrary data structures such asDictionary objects. For example, this type of pool may be used byoperators that may need to frequently allocate and de-allocate suchstructures. The second type is a memory pool for events, which isassociated with a DataBatch type, and contains a ColumnPool<T> for eachcolumn type T necessary for the batch. The pool itself may be generated.A ColumnPool<T> contains a pool (implemented as a queue) of freeColumnBatch entries.

ColumnBatch<T> instances are ref-counted, and each ColumnBatch<T>instance knows what pool it belongs to. When the RefCount for aColumnBatch instance goes to zero, the instance is returned to theColumnPool. When an operator needs a new ColumnBatch, the operatorrequests the ColumnPool (via the MemoryPool) for the same. TheColumnPool either returns a pre-existing ColumnBatch from the pool (ifany) or allocates a new ColumnBatch.

In a streaming system, the system may reach a “steady state” where allthe necessary allocations have been performed. After this point, theremay be very few new allocations occurring, as most of the time batcheswould be available in the pools. In some cases, there may be one memorypool allocated per operator. Further, in some cases, there may be asingle global memory pool configuration. In some implementations, theremay be one memory pool per core (or socket) serving all operatorsassigned to that core.

Returning to the idea of query compilation and processing, as discussedabove, queries may be expressed against a row-oriented view of the data.In some implementations, the input data sources (e.g., the input datasources 102 of FIG. 1) may be modeled as instances of a type calledIStreamable, that can for instance be created from existing data sourcessuch as an IObservable. The application of each logical operator resultsin another IStreamable, allowing for the composition of larger queryplans. Some operators such as “Join” receive two IStreamable instancesas input and produce a single result IStreamable. The result of thequery construction is therefore an IStreamable—note that at this point,no operators are generated and no query is executing.

Some implementations include binding the query to an execution bycreating an observer (e.g., an input observer 114 as in FIG. 1) andsubscribing to the query (e.g., the query 108 in FIG. 1) by callingSubscribe( ) on the IStreamable. This causes the compiler to walk theIStreamable DAG, allowing each logical operator to construct a physicaloperator instance, thereby generating the physical operator DAG (e.g.,designated as 112 in FIG. 1). The data sources, on being subscribed to,may start pushing events through the physical operators, producingresults that are eventually delivered to the result observers (e.g., theresult observers 120 in FIG. 1).

As an illustrative, non-limiting example, consider a stream ps whosepayload is of type TitanicPassenger (see e.g., the example payload datatype 302 in FIG. 3). A user may have a logical view that the streamincludes values of type “TitanicPassenger,” and the user may express aquery as a method call (e.g., methods such as Where and Select). Theargument to each method is a function, where the notation x=>e describesa function of one argument x. When the function is applied to a value(in this case of type TitanicPassenger), the result is the value e.

To illustrate, an example query about passengers on the Titanic may berepresented as “ps.Where(p=>p.survived && p.age>30).” In this case, themethod Where is a filter. That is, its value is a new stream includingthose records (tuples) of the original stream whose field survived had avalue of true and whose field age was greater than 30. For example,referring to the example payload data type 302 in FIG. 3, the new streamwould include those records in which the first field 302 (“Survived”)has a value of true and the fifth field 312 (“Age”) has a value that isgreater than 30.

As another example, an example query about passengers on the Titanic maybe represented as “ps.Select(p=>newNewPassenger{NameSex=p.name+“,”+p.sex, Age=p.age}).” In this case, themethod Select is a projection. That is, its value is a new streamincluding those records (tuples) of a new type, NewPassenger, distinctfrom TitanicPassenger, that has two fields. The first field is calledNameSex and its value is a concatenation of the name field and the sexfield of the corresponding record from the original stream. Likewise,the second field is called Age and its value is the value of the agefield of the corresponding record. For example, referring to the examplepayload data type 302 in FIG. 3, the first field NameSex would have avalue corresponding to a concatenation of the third field 308 (“Name”)and the fourth field 310 (“Sex”). The second field Age would have avalue corresponding to the fifth field 312 (“Age”).

These example queries may be used to illustrate the different aspects ofhow the queries are compiled into a form that executes against the dataorganization described above. As described above with respect to FIG. 3,the data organization does not include the type TitanicPassenger.Instead, the type BatchGeneratedForTitanicPassenger is generated. Assuch, queries that, at the user level, involve the type TitanicPassengeruse the generated type BatchGeneratedForTitanicPassenger instead. Thisnon-uniformity means that, for efficient execution, the query submittedby the user is compiled into a custom operator. The custom operator is acode module which is loaded (and executed) dynamically in response to auser-formulated query.

Referring to FIG. 5, an example process of code-generation at querycompile-time is illustrated and generally designated 500. It will beappreciated that the order or processing (serial/parallel) may be donein alternatives to that shown in the example of process flows in thepresent disclosure.

In some implementations, the present disclosure includes creating customphysical operators using a combination of an application programminginterface (“API”) for expression trees (e.g., the .NET API forexpression trees) together with meta-programming. An expression tree isan abstract syntax tree representing a fragment of code and uses thetype Expression<T> to represent a piece of code of type T. Theexpression tree API may also contain functionality for transformingabstract syntax trees.

At 502, the process 500 includes receiving a user-formulated query. Forexample, the user-formulated query may be expressed as a “Select” methodcall. The type of the parameter for the method Select isExpression<Func<TPayload, U>>, where Func<A,B> is the type of a functionthat takes an argument of type A and returns a value of type B.

At 504, the process 500 includes generating a transformed expressiontree. The transformation may include replacing all references to afield, f, of a value, r, to become references to the i^(th) row in thecolumn corresponding to f. In some implementations, there may be furtheroptimizations such as pointer swings (described further below) that areconstant-time operations on a batch, instead of having to iterate overeach row in a DataBatch.

FIG. 5 illustrates one example implementation of transforming a userquery expressed using a row-oriented model to create executable codethat accesses the column-oriented view of the data. In the particularexample illustrated in FIG. 5, after generating the transformedexpression tree, the process 500 includes generating a source file(e.g., a C# source file), at 506. In some cases, meta-programmingcapabilities may be used to create the source file, and the process 500may include compiling the source file into a customer operator, at 508.The custom operator may include a code module which may be loaded andexecuted dynamically in response to the user-formulated query, as shownat 510. The transformed expressions may be in-lined into the operator.

In some cases, it may not be possible to generate a custom operator. Forexample, a query might require calling a method on an instance ofTitanicPassenger. In such a situation, we may or may not be able (orwant) to recreate an instance given the values for its fields. We thentransform the data back to the row-oriented format and use thenon-generated static (generic) definition of the operator (i.e., whereeach DataBatch has a column of Payload values).

Alternative methods may be used to transform code that is specifiedusing the row-oriented model to the column-oriented model. For example,this may include performing a binary transformation at the machine codelevel (either at the IL level or even at a lower machine-specificlayer). As another example, instead of generating C# source file andthen compiling the file (e.g., as shown in steps 506 and 508 in theexample of FIG. 5), one or more application programming interfaces(APIs) may be used for directly emitting executable code (e.g., theReflection.Emit APIs in .NET). Another option may include the use a“rewriting framework” that takes as input the low-level machine code,decompiles it into a higher-level representation and then thetransformations happen at that level. The rewriting framework then hasmechanisms for producing executable machine code from the transformedhigh-level representation. As another example, the system may restrictthe user to a particular programming system that provides access to theassociated source code (that is, rather than users providing executablecode directly to the system) for use in source-level transformations.

The end result of this transformation is that we have, in a fullyincremental streaming setting, the following capabilities: the abilityto use a columnar data representation when possible; the flexibility ofusing complex types by reverting to non-generated row-oriented code whennecessary; and the ability for users to write their queries using asingle row-oriented view of the data.

In some implementations, the same meta-programming techniques may beused to generate specialized operators for transforming a row-orientedDataBatch into a column-oriented DataBatch (and vice-versa). This canoccur either because of some property of the user type prevents it frombeing used in column-oriented processing, some limitation in theoperator, or an expression in the query is too complex or opaque toallow its transformation. In any case, data usually enters and leavesthe system as row-oriented and thus is to be transformed at thebeginning of the processing pipeline into columnar format and thenre-created in row format at the end of the pipeline.

The present disclosure also supports compile-time stream properties thatdefine restrictions on the stream. For example, each IStreamable maycarry with it a set of properties that propagate across operators in thelogical plan (e.g., the logical query plan 104 in FIG. 1). Whentranslating the query into its DAG of physical operators, the propertiesincluded with the IStreamable may be used to choose specific operatorimplementations that perform well for the given properties. For example,an interval-free stream may use a version of an “Aggregate” operatorthat does not need to remember information about future end-edges(endpoints of intervals).

As explained above, in the present disclosure all operators areSyncTime-ordered, and ingest/produce data in non-decreasing order ofSyncTime. In cases where data arrives out-of-order, disorder may beremoved (e.g., by buffering and reordering) before feeding the data tothe engine.

As an illustrative example, with respect to Where (filtering), considerthe following example code:

for (int=0; i<input.Count; i++)

-   -   if (!BitVector[i]) {BitVector[i]=!predicate(input.Payload[i];}

In this example, in the row-oriented mode, the filtering operatoriterates over each row in the DataBatch, and if that row represents avalid tuple in the data, applies the user-supplied predicate to thepayload. The BitVector is updated to record the value of the predicate.Note that predicate is a compiled form of the function that the userwrote.

The custom operator for Select in the example gets an input value,input, of type BatchGeneratedForTitanicPassenger. There are no values oftype TitanicPassenger to apply the predicate to. Instead, the operatoris made more efficient by inlining the body of the (transformed)predicate which just accesses the columns corresponding to the fieldssurvived and age.

With respect to Select (Projection), as an illustrative example,consider the following examples of code:

for (int=0; i<input.Count; i++)

-   -   if (!BitVector[i]) {BitVector[i]=!(input.survived[i] &&        input.age[i]>30);} for (int=0; i<input.Count; i++)    -   if (!BitVector[i])        {output.NameSex[i]=input.name[i]+“,”+input.sex[i];}

As with the filter, the custom operator for Select is much moreefficient because it accesses only the columns mentioned in the query.For instance, the value of the “NameSex” field is set by iterating overeach row i in the batch. The assignment is done conditionally because wehave to check the BitVector to make sure that the row may be consideredvalid or not. However, there are even opportunities for furtheroptimization. For example, there is not an assignment for the Age fieldwithin the loop. That is, because the value of that field for every rowin the output is exactly the value of the field for that row in theinput. Since the fields are independent objects (of typeColumnBatch<int>), the same column can be shared between the input andoutput. This is referred to herein as a “pointer swing.” That is, ratherthan iterating over each row i, instead a single assignment (i.e.,output.Age=input.Age;) may be performed once at the message (batch)level.

Referring now to FIG. 6, an example process associated with a snapshotoperator producing incremental snapshot-oriented results is illustratedand generally designated 600.

The snapshot operator is a basic operator in the system of the presentdisclosure that produces incremental snapshot-oriented results. A set ofalgorithms for the snapshot operator may achieve high performance whileadhering to the temporal semantics of the operation. The process 600illustrated in FIG. 6 corresponds to a particular case in which a streamhas arbitrary events. As such, intervals may present endpoints inarbitrary order. Therefore, a data structure may be employed such thatitems can be added in arbitrary order of time, but that allowsretrieving in increasing timestamp order.

Recall that every stream is grouped, i.e., every event has a groupingkey. The aggregate operator uses two data structures to store staterelated to the aggregation operation. The first data structure includesan “AggregateByKey” data structure that represents a hash table thatstores, for every distinct key associated with non-empty state at themoment, an entry in the hash table with that key and the associatedstate. The second data structure includes a “HeldAggregates” datastructure. This is a hash table called “FastDictionary” (describedfurther below) that stores, for the current timestamp T, all the partialaggregated state corresponding to keys for which events arrived withsync time equal to T. This hash table is emptied out whenever time movesforward. It does not require the ability to delete individual keys, butrequires the ability to iterate through all entries very fast, and clearthe hash table very fast when time moves to the next value.

Referring to FIG. 6, the snapshot operator operates as follows. At 602,as events with the same sync time are received, the process 600 includesbuilding up the current set of partial aggregates in the HeldAggregatesdata structure. As shown at 604, when time moves forward, the process600 includes issuing start-edges for these partial aggregates (if notempty) and folding these partial aggregates into the larger set storedin AggregateByKey data structure.

At 606, the process 600 includes removing entries in the AggregateByKeydata structure that are now empty in order to maintain “empty-preservingsemantics.” The process 600 further includes issuing end-edges as neededfor prior aggregates, at 608, and clearing the HeldAggregates datastructure for reuse during the next timestamp.

While not illustrated in FIG. 6, the process 600 may also includemaintaining an endpoint compensation queue (or “ECQ”) that contains, foreach future endpoint, partially aggregated state for that endpoint.Whenever time moves forward, the endpoints may be processed between nowand the new timestamp from the ECQ. The aggregate operatorimplementation of the present disclosure may also cache the stateassociated with the currently active key, so that the common case whereall events have the same key can be executed very efficiently withouthash lookups.

Further, in an alternative case, the stream has only start and endedges. As such, an ECQ is not needed to store endpoints. The algorithmtherefore becomes simpler with less code and thus is more efficient. Inanother alternative case, the stream has start edges, end edges, and/orintervals of constant duration. As such, the intervals provideinformation for future points, but the information is presented innon-decreasing order. This implies that we can store the ECQ as a FIFOqueue that adds items at one end and removes items from the other end.

As described above, the HeldAggregates data structure is a hash tablethat may be referred to as a “FastDictionary.” The FastDictionary mayrepresent a lightweight .NET dictionary that is optimized for frequentlookups, small sets of keys, frequent clearing of the entire datastructure, and frequent iteration over all keys in the table. TheFastDictionary attempts to minimize memory allocations during runtimeand hash computations.

FastDictionary uses open addressing with sequential linear probing. Thebasic data structure is a prime-number sized array A of <key, value>pairs. An entry to be looked up or inserted is hashed to an index in A.If that entry is occupied, FastDictionary may scan entries in Asequentially until the element is located (or an open slot whichindicates lookup failure). The sequential probing is well suited to CPUcaching behavior, and with a suitably low load factor ( 1/16 to ⅛),there is a high likelihood of finding an element very quickly. We resizethe hash table when necessary to maintain the low load factor.

The array A is augmented with a bitvector B, which has one bit per arrayelement to indicate whether that entry is used. B allows iteration to beperformed very efficiently, and insertion can find an empty slot indexwithout having to access A. Further, clearing the dictionary isstraightforward: we simply zero out the bitvector. Note that this meansthat the GC can free older values only when entries are reused later,but this is usually not an issue. Accesses to the bitvector are veryfast due to cache locality. In some implementations, FastDictionary mayperform up to 40% better than a standard .NET Dictionary for streamingworkloads, when used inside the aggregation operator to maintain partialaggregation states for the current timestamp.

Referring now to FIGS. 7 and 8, example processes associated with atemporal join operation that operates on input data events in SyncTimeorder across two inputs are illustrated and generally designated 700 and800.

The join operation is implemented as an equijoin with delegates providedto determine a mapping key for each payload. The join key is the same asthe grouping key associated with the stream. When a payload from the twoinputs has the same key and overlaps temporally, the join executesanother delegate to generate output for the pair of payloads. Outer joinis implemented as an equijoin where all payloads have the same mappingkey. In the present disclosure, we describe optimizing the joinoperation for two special cases. In the first case illustrated in FIG.7, the input has only start edges and events will not be removed fromthe synopsis, while in the second case illustrated in FIG. 8, the inputhas arbitrary events.

Referring to FIG. 7, the process 700 includes processing all inputevents in sync-time order, at 702. The processing of an input eventincludes inserting the input event into a map data structure, with a keyequal to a mapping key of the input event and a value equal to a payloadof the input event. At 704, the process 700 includes searching the mapdata structure for an alternate input to identify events that arereceived with the same mapping key from the alternate input. Anypayloads that are identified as having the same mapping key correspondto a successful join for which the operator will generate output, at706.

Referring to FIG. 8, the process 800 illustrates the second case inwhich the input has arbitrary events. When the input events can includeintervals and end edges, then the operator handles the case of eventsbeing removed. The operator still employs two map data structuressimilar to the start edge case, but also employs an endpointcompensation queue to remove intervals once time progresses to theirending timestamp, as shown at 802. Additionally, the operator cannotdetermine which newly processed payloads successfully join start edgesfrom the alternate input until time progresses beyond the currenttimestamp (because any active start edges could always be ended by anend edge). Note that this delay in output generation is only for thecase of an event joining against a start edge because intervals haveknown ending timestamps. When events are removed from the map, either byend edges or the reaching of an interval's ending timestamp, theoperator performs a search of the alternate input's map to identifypayloads which joined previously, as illustrated at 804. For each ofthose payloads, the operator outputs a corresponding end edge, at 806.While not illustrated in FIG. 8, the case of an interval joining with aninterval is handled specially because the exact duration of the join isknown beforehand, so the operator outputs an interval for thecorresponding duration.

Further, while not illustrated in FIG. 7 or 8, there are specializedversions of “Join” to handle two cases. In the first case, the inputsare asymmetric, and the join operation is referred to as an “AsymmetricJoin.” That is, the right side is much smaller/lower throughput than theleft side. In the second case, the inputs are sorted by the join key orits superset, and the join operation is referred to as a“Sort-Order-Aware Join.” These variants of Join are discussed furtherbelow.

In the present disclosure, a “WhereNotExists” operator and a “Clip”operator represent “anti-joins” that output only those events receivedon their left input that do not join with an event received on theright. Similar to join, the user provides delegates to determine amapping key for each payload. Clip is a restricted form ofWhereNotExists that is optimized for the common case of permanentlyclipping an event received on the left when a future right eventsuccessfully joins with it. The implementations for the two operatorsmay be optimized differently.

For the “WhereNotExists” operation, all input events received on theleft input are processed, in sync-time order, and inserted into a mapdata structure with the key and value equal to the mapping key andpayload, respectively, of the input event. Similarly, input eventsreceived on the right input are processed in sync-time order andinserted into a data structure that counts the number of occurrences ofa mapping key on the right input. Any start edge events received on theright input, for which it is the first occurrence of that key, resultsin a scan for any joining left inputs which require the output of an endedge. Similarly, any end edge events received on the right input, forwhich the resulting occurrence count drops to zero, results in a scanfor any joining left inputs which now require the output of a startedge. Only when time progresses on the right input is a scan performedto search for any newly inserted left events that do not join with anyevent on the right to output an initial start edge.

The “Clip” operation is similar but a much more optimized version ofWhereNotExists. In Clip, only events received on the right at a latertimestamp can join to events received on the left. As a result, no rightstate may be maintained. Instead, only a map of events received on theleft input is required. As events are received on the left, the operatoroutputs a corresponding start edge and inserts the event into a map. Asevents are received on the right, the operators performs a scan in themap to locate any joining left events. All joining left events will beremoved from the map and output an end edge.

In the present disclosure, Join, WhereNotExists and Clip may be scaledout by writing them as GroupApply operations, as described furtherbelow. The GroupApply operation sets the key of the stream to the joinkey. The above operators assume that the key of the stream is theequijoin attribute, thus join works efficiently and interoperatescorrectly in the context of GroupApply.

Referring now to the concept of Punctuations and the passage of time,data events also serve as implicit punctuations until the sync time ofthe event because of the property that all streams are in non-decreasingsync time order. The streaming engine of the present disclosure retainsan explicit notion of punctuations, but for two distinct reasons. First,punctuations signify the passage of time during periods of lulls (i.e.,no data). Note that even if the input stream has data events, there maybe an operator such as Where that filters out most (or all) eventsleading to lulls in its output stream. Second, the streaming enginetries to maximize performance by ingressing data in batches. Likewise,operators write output data events to a temporary holding batch, andwait before propagating the batch to the next downstream operator, untileither the batch is full or a punctuation or a completed control eventis received. Thus, punctuations serve to “kick the system” and force itto generate output regardless of the current state of operators in thequery plan. Thus, punctuations in Trill allow users to control thethroughput/latency tradeoff by controlling how aggressively data isbatched, and when data is forced to egress out from the system.

Punctuations and the data batch maximum size control thethroughput/latency tradeoff Operators try to fill up a batch beforesending it to the next operator, thereby introducing latency. Users mayintroduce frequent punctuations in order to force the engine to outputbatches early, thereby reducing latency. However, a very frequentpunctuation strategy may result in many batches being almost empty,which could result in a significant waste of memory.

A simple first step towards solving this problem is to allow users tocontrol the batch size using a global configuration setting. This allowsus to avoid some really bad cases, but does not cover the cases wheredifferent operators see different punctuation frequencies due to queryand data semantics. For example, an input may have a punctuation every1000 events, but the result of a filter with selectivity 0.2 would havea punctuation every 200 events. The punctuation frequency may also bedata dependent and change with time (e.g., how many tuples join with agiven input in Join). Thus, we propose a dynamically adaptive batch-sizesetting algorithm that works as follows. Let DefaultBatchSize denote the“ideal” batch size in the absence of punctuations, where performancedoes not improve significantly (on expectation) when the batch size isincreased further (80,000 by default). Let StartBatchSize be thesmallest batch size that gives reasonable performance on expectation,such that a lower batch size degrades performance significantly (250 bydefault).

Referring now to FIG. 9, an example process associated with a dynamicbatch-size setting algorithm is illustrated and generally designated900. The process 900 includes, for each operator, starting the batchsize at MaxBatchSize=StartBatchSize, at 902. At 904, the process 900includes observing the first k punctuations, or waiting until the firstDefaultBatchSize events are observed.

In the illustrative example of FIG. 9, the process 900 includes settingthe batch size to MaxBatchSize=MIN(max. batch fill*1.25,DefaultBatchSize), at 906. Further, the process 900 includes maintaininga running average of observed actual outgoing batch sizes, at 908, andadjusting the MaxBatchSize periodically, at 910.

The memory pools (per operator) are also modified to handle varyingbatch sizes by allowing us to return batches of the older size to the GCinstead of returning them to the pool. As a result, the memory pool atany given moment only contains batches of a fixed size, which may adjustperiodically.

The system of the present disclosure also supports the notion of groupedsub-queries. The basic idea is that given an input stream, the userprovides a grouping key and a subquery to be executed for every distinctvalue of the key. Logically, the sub-query is executed on everysubstream consisting of events with the same distinct value of the key.The GroupApply operator in Trill executes such a query very efficientlyon a multi-core machine. We now describe the novel two-stagearchitecture of GroupApply that makes this possible.

Referring now to FIG. 10, an example process associated with a groupingof sub-queries is illustrated and generally designated 1000. The presentdisclosure includes logically modeling the GroupApply operator using atwo-stage streaming GroupApply architecture. The operation has multiplestages that we describe next. Overall, there are NumBranchesL1 copies ofthe map/shuffle stage, and NumBranchesL2 copies of the apply/reducestage.

The process 1000 includes, at 1002, performing a “Spray/Map” operationthat takes a stream of batches and performs a stateless spray of thebatches to multiple cores (equal to NumBranchesL1) via blockingconcurrent queues. This operator performs a constant amount of work perbatch and hence introduces negligible overhead to the system.

At 1004, the process 1000 includes performing a “Map/Group/Shuffle”operation. This operator resides on each of NumBranchesL1 cores andreceives events in a round-robin fashion from the sprayers. The shufflemaintains NumBranchesL2 number of output batches. For each incomingbatch, the operator applies the map subquery and then generates thegrouping key and its associated hash function on the resulting stream.Based on the hash value, the operator stores the tuple in one of theNumBranchesL2 partial output batches. When an output batch gets filledup, the batch is delivered to the corresponding “Apply” branch via alock-free blocking concurrent queue.

At 1006, the process 1000 includes performing a “Merge/Apply/Ungroup”(streaming reduce) operation. At each of the NumBranchesL2 cores, theoperator first performs a union of data from NumBranchesL1 upstreaminputs. The union is a temporal union (maintaining timestamp order) inorder to retain the invariant that all streams are in strict sync-timeorder. This is followed by an application of the reduce sub-query oneach of the cores. This is followed by an ungroup operation that unpeelsthe grouping key (nesting is allowed). The results of these are suppliedto a “Final Merge” operator to perform a final merge operation, at 1008.

In the final merge operation, the result batches from the reduceoperations are merged into a single output stream in temporal order.This operation is performed by the same merger described above at 1006that merges multiple streams into one. Merging may include the use of asequence of cascading binary merges in order to scale out the mergeacross multiple cores (each binary merge can potentially reside on adifferent core). Further, the use of binary merges may providesignificant performance benefits over using a priority queue to performa single n-way merge. In some implementations, three cores are used toperform the merge: one for the root, and two more for the left and rightsub-trees respectively, although other configurations are also possible.

The system of the present disclosure also supports an apply branch withtwo inputs. The basic idea and architecture are similar to thedescription above, except that there are two separate map phases foreach of the inputs, and these map outputs are shuffled and broughttogether to a single set of two-input reducers. The details are omittedfor brevity.

With respect to sorted data and exploiting sort orders, when performingoffline relational queries (including progressive queries), the systemof the present disclosure provides the ability to pre-sort the data bysome key in order to optimize query processing. The system of thepresent disclosure supports a compile-time property to identify whetherthe data is snapshot-sorted (i.e., whether each snapshot is sorted bysome payload key). In case of progressive data, snapshot-sorted impliesa global sort order for the data.

The system of the present disclosure also supports the notion ofsort-order-aware data packing. The basic idea here is that sorted datais packed into batches according to the rule that for a given batch B,data with a given sort key value K cannot spill to the next batch B+1,unless all the data in batch B has the same sort key value K.

The spray phase (See e.g., FIG. 10, at 1002) of GroupApply can exploitthis packing scheme to retain the sort order during spray. Basically, itretains the last key in the current batch B before spraying it to corei. In case the first element in the next batch B+1 has the same keyvalue, that batch is also sprayed to the same core i. Otherwise, thebatch B+1 is sprayed as usual to the next core i+1 (or modulo the numberof cores). This choice allows the sort ordering and packing property tobe retained within each downstream branch.

Further, if the GroupApply key happens to be equal to or a subset of thesorting key, we move the apply sub-query into the spray phase, therebycompletely avoiding the shuffle.

With respect to Join operations, the system of the present disclosuresupports a variant of join called the “AsymmetricJoin.” The basic ideais that if the left side of the join is such smaller than the rightside, we multicast the left side to all the cores and simply spray theright side round-robin across the join operator instances. This allowsus to completely avoid the shuffle phase of the map-reduce, at theexpense of having a duplicate copy of the (smaller) left side at allcores. The user is allowed to choose this variant of Join by specifyingan optional parameter to the Join operator. The asymmetric joinoperation is supported by an asymmetric two-input reduce phase thatmulticasts the smaller side and sprays the larger side of the reducer.

The system of the present disclosure supports another variant of joincalled a “Sort-Order-Aware Join.” In case the data is sorted by same key(or a superset) as the join key, a more efficient version of join may beused that does not perform any hashing of tuples. This is similar to themerge join operator in databases. The system automatically choosesbetween the traditional hash join and the merge join based oncompile-time stream properties.

The system of the present disclosure also supports “Schedulers.” In oneimplementation, the GroupApply operator is used as the building blockfor scaling out to multiple processor cores. Further, in some cases, amore general scheduler mechanism may allow more effective sharing ofresources across multiple queries and operators. The basic idea is tocreate and assign a scheduler per core, and allocate operators acrossthese schedulers. The scheduler logic operates purely at batchboundaries, thus avoiding a performance penalty. Communication acrossschedulers (in case a downstream operator is assigned to a differentscheduler) is accomplished using efficient lock-free queues of batchedmessages.

Example Computing Device and Environment

FIG. 11 illustrates an example configuration of a computing device 1100and an environment that can be used to implement the modules andfunctions described herein.

The computing device 1100 may include at least one processor 1102, amemory 1104, communication interfaces 1106, a display device 1108 (e.g.a touchscreen display), other input/output (I/O) devices 1110 (e.g. atouchscreen display or a mouse and keyboard), and one or more massstorage devices 1112, able to communicate with each other, such as via asystem bus 1114 or other suitable connection.

The processor 1102 may be a single processing unit or a number ofprocessing units, all of which may include single or multiple computingunits or multiple cores. The processor 1102 can be implemented as one ormore microprocessors, microcomputers, microcontrollers, digital signalprocessors, central processing units, state machines, logic circuitries,and/or any devices that manipulate signals based on operationalinstructions. Among other capabilities, the processor 1102 can beconfigured to fetch and execute computer-readable instructions stored inthe memory 1104, mass storage devices 1112, or other computer-readablemedia.

Memory 1104 and mass storage devices 1112 are examples of computerstorage media for storing instructions which are executed by theprocessor 1102 to perform the various functions described above. Forexample, memory 1104 may generally include both volatile memory andnon-volatile memory (e.g., RAM, ROM, or the like). Further, mass storagedevices 1112 may generally include hard disk drives, solid-state drives,removable media, including external and removable drives, memory cards,flash memory, floppy disks, optical disks (e.g., CD, DVD), a storagearray, a network attached storage, a storage area network, or the like.Both memory 1104 and mass storage devices 1112 may be collectivelyreferred to as memory or computer storage media herein, and may becomputer-readable media capable of storing computer-readable,processor-executable program instructions as computer program code thatcan be executed by the processor 1102 as a particular machine configuredfor carrying out the operations and functions described in theimplementations herein.

The computing device 1100 may also include one or more communicationinterfaces 1106 for exchanging data with other devices, such as via anetwork, direct connection, or the like, as discussed above. Thecommunication interfaces 1106 can facilitate communications within awide variety of networks and protocol types, including wired networks(e.g., LAN, cable, etc.) and wireless networks (e.g., WLAN, cellular,satellite, etc.), the Internet and the like. Communication interfaces1106 can also provide communication with external storage (not shown),such as in a storage array, network attached storage, storage areanetwork, or the like.

The discussion herein refers to data being sent and received byparticular components or modules. This may not be taken as a limitationas such communication need not be direct and the particular componentsor module need not necessarily be a single functional unit. This is notto be taken as limiting implementations to only those in which thecomponents directly send and receive data from one another. The signalscould instead be relayed by a separate component upon receipt of thedata. Further, the components may be combined or the functionality maybe separated amongst components in various manners not limited to thosediscussed above. Other variations in the logical and practical structureand framework of various implementations would be apparent to one ofordinary skill in the art in view of the disclosure provided herein.

A display device 1108, such as touchscreen display or other displaydevice, may be included in some implementations. Other I/O devices 1110may be devices that receive various inputs from a user and providevarious outputs to the user, and may include a touchscreen, such as atouchscreen display, a keyboard, a remote controller, a mouse, aprinter, audio input/output devices, and so forth.

Memory 1104 may include modules and components for execution by thecomputing device 1100 according to the implementations discussed herein.In the illustrated example, memory 1104 includes a query processingengine 1114 as described above. Memory 1104 may further include one ormore other modules 1116, such as an operating system, drivers,application software, communication software, or the like. Memory 1104may also include other data 1118, such as data stored while performingthe functions described above and data used by the other modules 1116.Memory 1104 may also include other data and data structures described oralluded to herein.

The example systems and computing devices described herein are merelyexamples suitable for some implementations and are not intended tosuggest any limitation as to the scope of use or functionality of theenvironments, architectures and frameworks that can implement theprocesses, components and features described herein. Thus,implementations herein are operational with numerous environments orarchitectures, and may be implemented in general purpose andspecial-purpose computing systems, or other devices having processingcapability. Generally, any of the functions described with reference tothe figures can be implemented using software, hardware (e.g., fixedlogic circuitry) or a combination of these implementations. The term“module,” “mechanism” or “component” as used herein generally representssoftware, hardware, or a combination of software and hardware that canbe configured to implement prescribed functions. For instance, in thecase of a software implementation, the term “module,” “mechanism” or“component” can represent program code (and/or declarative-typeinstructions) that performs specified tasks or operations when executedon a processing device or devices (e.g., CPUs or processors). Theprogram code can be stored in one or more computer-readable memorydevices or other computer storage devices. Thus, the processes,components and modules described herein may be implemented by a computerprogram product.

Although illustrated in FIG. 11 as being stored in memory 1104 ofcomputing device 1100, the query processing engine 1114, or portionsthereof, may be implemented using any form of computer-readable mediathat is accessible by computing device 1100. As used herein,“computer-readable media” includes, at least, two types ofcomputer-readable media, namely computer storage media andcommunications media.

Computer storage media includes volatile and non-volatile, removable andnon-removable media implemented in any method or technology for storageof information, such as computer readable instructions, data structures,program modules, or other data. Computer storage media includes, but isnot limited to, RAM, ROM, EEPROM, flash memory or other memorytechnology, CD-ROM, digital versatile disks (DVD) or other opticalstorage, magnetic cassettes, magnetic tape, magnetic disk storage orother magnetic storage devices, or any other non-transmission mediumthat can be used to store information for access by a computing device.

In contrast, communication media may embody computer readableinstructions, data structures, program modules, or other data in amodulated data signal, such as a carrier wave, or other transmissionmechanism. As defined herein, computer storage media does not includecommunication media.

Thus, the present disclosure describes a high-performance lightweightstream processing engine that may execute temporal streaming queries atspeeds between 10 and 1000 times faster than existing engines. Thislevel of performance may result from a bottom-up emergent system design,along with techniques such as aggressive physical batching, columnardata storage with row-oriented data access, code-generation for highlyoptimized operations, a carefully restricted physical model with newsync-time-ordered operator algorithms, and a careful separation offine-grained work performed with each event and coarse-grained workperformed at boundaries of groups (or batches) of events. Thus, theprocessing engine of the present disclosure may represent a“one-size-fits-many” engine that may be effectively and efficiently usedfor real-time and offline temporal queries, as well as relationalqueries and progressive relational queries.

Some examples include a batched sequential physical data organizationfor a streaming engine (without affecting result semantics), along withan emergent engine design pattern that carefully separates fine-grainedwork from course-grained work (at batch boundaries) during queryprocessing. In the present disclosure, incoming data may be carefullyorganized into batches, with the flow of batches through the graph ofstream operators, the building and outputting batches at each operator,the output of results in batches, and the use of well-defined logicalsemantics ensuring that batching does not affect result content orcorrectness, but only potentially actual latency, and the architectingof the system to perform expensive work at batch boundaries whilecarefully executing fine-grained intra-batch work inside operators intight loops.

In some examples, a constrained sync-time ordered physical data modelmay be used along with associated high-performance algorithms for thestreaming of tempo-relational operations such as snapshot aggregation,join, and set difference. That is, the streaming engine of the presentdisclosure may restrict the nature of data that is present in thebatches so that operators can use more efficient algorithms, along witha set of operator algorithms and data structures that exploit thesephysical restrictions.

In some examples, the present disclosure describes the re-interpretationof punctuations to signify the passage of time as well as forcing outputgeneration in a batched sequential streaming engine, and enable theintroduction of adaptive batch sizing to open up the exploitation of theassociated latency-throughput tradeoff That is, the present disclosuredescribes extending the notion of punctuations, a concept that is commonin stream systems, to force output generation and flushing of partiallyfilled batches and to denote the passage of time as in traditionalpunctuations. Further, this new definition of punctuations may beleveraged to provide users with an ability to exploit the tradeoffbetween throughput (more throughput with larger batches) and latency(low latency with smaller batches). Further, the adaptive setting ofmaximum batch sizes may be exploited to better control the memory andCPU overhead of batching.

In some implementations, bitvector filtering and columnar dataorganization may be used in the streaming engine as techniques tominimize data movement between CPU and main memory and to improvethroughput by maximizing the use of main memory bandwidth. That is, thepresent disclosure may utilize the design principle of architecting astreaming engine with the goal of minimizing data movement between mainmemory and caches, along with specific techniques to enable this design,such as bitvector filtering using a bitvector within each batch, and acolumnar organization of control and payload fields within a batch.

Some examples include techniques to automatically and transparentlycontrol row-oriented data access and query specification in a high-levellanguage to column-oriented inlined data access inside a streamingengine (with the above-mentioned physical data organization) usingdynamic code generation and compilation. Further, the system of thepresent disclosure provides the ability to transparently fall-back torow-oriented execution in the case of complex types or complex userexpressions over the types in queries, in the context of a querylanguage that takes arbitrary row-valued functions as user input.Further, the system of the present disclosure provides the ability totransition from row-oriented to column-oriented data organization (andvice versa) within a query plan in the context of the above streamingengine design, in a query-dependent manner.

Some examples include techniques and arrangements for a two-stagescale-out architecture for scaling out grouped subqueries (e.g.,“GroupApply”) with potentially more than one input across multipleprocessor cores, along with the use of compile-time properties andalgorithms to exploit asymmetry and sort orders in the input streams.

Furthermore, this disclosure provides various example implementations,as described and as illustrated in the drawings. However, thisdisclosure is not limited to the implementations described andillustrated herein, but can extend to other implementations, as would beknown or as would become known to those skilled in the art. Reference inthe specification to “one implementation,” “this implementation,” “theseimplementations” or “some implementations” means that a particularfeature, structure, or characteristic described is included in at leastone implementation, and the appearances of these phrases in variousplaces in the specification are not necessarily all referring to thesame implementation.

CONCLUSION

Although the subject matter has been described in language specific tostructural features and/or methodological acts, the subject matterdefined in the appended claims is not limited to the specific featuresor acts described above. Rather, the specific features and actsdescribed above are disclosed as example forms of implementing theclaims. This disclosure is intended to cover any and all adaptations orvariations of the disclosed implementations, and the following claimsshould not be construed to be limited to the specific implementationsdisclosed in the specification. Instead, the scope of this document isto be determined entirely by the following claims, along with the fullrange of equivalents to which such claims are entitled.

1. A method comprising: under control of one or more computing devices:receiving a stream of incoming data events; annotating individual onesof incoming data events with a first timestamp and a second timestamp,wherein: the first timestamp identifies when a particular data event inthe stream of incoming data events is received; and the second timestampidentifies additional information associated with the particular dataevent; organizing the stream of incoming data events into a sequence ofdata batches, wherein individual data batches include multiple dataevents; and processing the individual data batches in the sequence in anon-decreasing time order that is based on the first timestamp.
 2. Themethod of claim 1, wherein the individual data batches store a payloadarray that includes an array of all payloads within the particular databatch.
 3. The method of claim 2, wherein the payloads are arranged in acolumnar format.
 4. The method of claim 1, wherein the individual databatches store control parameters in a columnar format.
 5. The method ofclaim 4, wherein the control parameters stored in the columnar formatinclude: a synctime array that includes an array of synctime values ofall events in the batch; an othertime array that includes an array ofothertime values of all events in the batch; a bitvector that includesan occupancy vector representing an array with one bit; a key array thatincludes at least one array of grouping key values; and a hash arraythat includes an array of hash values.
 6. The method of claim 1, whereineach incoming data event is not annotated with a third timestamp.
 7. Themethod of claim 1, wherein: the incoming data event includes an intervalevent; the first timestamp includes a start time of the interval event;and the additional information identified by the second timestampincludes a known future time that the interval event is to end.
 8. Themethod of claim 1, wherein the incoming data event includes a start-edgeevent and an end-edge event.
 9. A computing system comprising: one ormore processors; one or more computer readable media maintaininginstructions that, when executed by the one or more processors, causethe one or more processors to perform acts comprising: receiving astream of incoming data events; annotating individual ones of incomingdata events with a first timestamp and a second timestamp, wherein: thefirst timestamp identifies when a particular data event in the stream ofincoming data events is received; and the second timestamp identifiesadditional information associated with the particular data event;organizing the stream of incoming data events into a sequence of databatches, wherein individual data batches include multiple data events;and processing the individual data batches in the sequence in anon-decreasing time order that is based on the first timestamp.
 10. Thecomputing system of claim 9, the acts further comprising adding apunctuation to a particular event, wherein: the punctuation is used tomark a passage of time when no incoming data is detected over a periodof time; or the punctuation is used to close out a partially filled databatch to produce an output for a downstream operator.
 11. The computingsystem of claim 9, wherein: the individual data batches store controlparameters in individual arrays that are arranged in a columnar format;the individual data batches store a payload array that includes an arrayof all payloads within the particular data batch arranged in a columnarformat; and processing the individual data batches in the sequenceincludes processing entire uninterrupted arrays without performingper-row encoding or per-row decoding.
 12. The computing system of claim9, wherein: a particular data batch stores control parameters inindividual arrays that are arranged in a columnar format; the particulardata batch stores a payload array that includes an array of all payloadswithin the particular data batch arranged in a columnar format; when aparticular payload is of a string type, individual strings are storedwithin the particular data batch end-to-end in a single character arraywith additional information associated with starts and offsets of theindividual strings; and processing the particular data batch includes:performing string operations directly on the single character array; orperforming string operations includes copying individual strings to abuffer and performing string operations on the buffer.
 13. The computingsystem of claim 9, wherein: a particular stream is logically grouped bya key that logically represents multiple distinct sub-streams; and eachdistinct sub-stream is associated with a distinct value of a groupingkey.
 14. The computing system of claim 13, wherein a single timestampdomain is associated with all groups such that passage of time occursacross all groups and not on a per-group basis.
 15. The computing systemof claim 13, wherein the grouping key and a hash value associated withthe grouping key are stored as part of a data event.
 16. The computingsystem of claim 9, the acts further comprising: receiving a query thatcorresponds to a row-oriented view of data; in response to receiving thequery, dynamically generating custom code corresponding to columnarrepresentation of the data; incorporating the dynamically generatedcustom code to generate a custom operator; and executing the customoperator against the columnar representation of the data to determinequery results to be provided in response to the query.
 17. The computingsystem of claim 16, wherein dynamically generating the custom codeincludes replacing references to a particular field having a particularvalue with references to a particular row in a column corresponding tothe particular field.
 18. One or more computer-readable mediamaintaining instructions that, when executed by one or more processors,cause the one or more processors to perform acts comprising: annotatingindividual ones of incoming data events of a stream of incoming dataevents with a first timestamp and a second timestamp, wherein: the firsttimestamp identifies when a particular data event in the stream ofincoming data events is received; and the second timestamp identifiesadditional information associated with the particular data event;organizing the stream of incoming data events into a sequence of databatches, wherein individual data batches include multiple data events;and processing the individual data batches in the sequence in anon-decreasing time order that is based on the first timestamp.
 19. Theone or more computer-readable media of claim 18, wherein the individualdata batches store: a payload array that includes an array of allpayloads within the particular data batch, wherein the payloads arearranged in a columnar format; a synctime array that includes an arrayof synctime values of all events in the particular data batch; anothertime array that includes an array of othertime values of all eventsin the particular data batch; a bitvector that includes an occupancyvector representing an array with one bit; a key array that includes anarray of grouping key values; and a hash array that includes an array ofhash values.
 20. The one or more computer-readable media of claim 18,wherein each incoming data event is annotated with only the firsttimestamp and the second timestamp.